[JAX] Expert Parallelism: JAX primitives + VJPs by phu0ngng · Pull Request #3036 · NVIDIA/TransformerEngine

phu0ngng · 2026-05-22T03:01:54Z

Summary

Third PR in the TE Expert Parallelism (EP) series, built on top of #3034. Lands the JAX bindings: an XLA FFI layer over the nvte_ep_* C API, a Python wrapper with custom_vjp for autograd, mesh-aware sharding rules, a multi-process test suite, and an end-to-end MoE example. NCCL ncclEpDispatch/ncclEpCombine are exposed as XLA primitives and work with CUDA-graph capture.

Implementation

Public Python API (`transformer_engine/jax/ep.py`)

from transformer_engine.jax.ep import (
    EpHandle,        # opaque (id, handle_mem) pair from ep_prepare
    ep_bootstrap,    # one-shot per-process: init NCCL comm + nvte_ep_initialize
    ep_dispatch,     # custom_vjp-wrapped dispatch 
    ep_combine,      # custom_vjp-wrapped combine

ep_dispatch / ep_combine are jax.custom_vjp functions: forward is the FFI primitive, backward calls the matching nvte_ep_*_bwd FFI primitive directly (no ep_prepare in the bwd — routing state is already cached in handle.mem). Note that ep_dispatch also calls ep_prepare in the forward path, which all-gathers and preprocesses routing maps.

XLA FFI bindings (`transformer_engine/jax/csrc/extensions/ep.cpp`)

Five XLA_FFI_DEFINE_HANDLER_SYMBOL entries — EpPrepareHandler, EpDispatchHandler, EpCombineHandler, EpDispatchBwdHandler, EpCombineBwdHandler — each calling the corresponding nvte_ep_* C entry point. All marked FFI_CudaGraph_Traits so they capture cleanly. handle_id is a static FFI attribute baked at jit trace time.

Primitives + Python layer (`transformer_engine/jax/cpp_extensions/ep.py`, +951 lines)

Standard TE primitive plumbing: abstract_eval (shape/dtype inference), lowering, impl, outer_primitive registration, and partitioning rules so the EP collective is treated as a single sharded op by XLA (no spurious resharding around it).

Sharding (`transformer_engine/jax/sharding.py`, +12 lines)

Adds the EP mesh axis to the global mesh resource set so downstream sharding rules can reference it.

Build wiring (`build_tools/jax.py`, +41 lines)

Threads NCCL EP linkage through the JAX transformer_engine_jax extension. No new top-level build flags; rides on the parent PR's NVTE_BUILD_WITH_NCCL_EP.

Tests & example

tests/jax/test_multi_process_ep.py (+690 lines): 13 tests covering bootstrap, ep_prepare shape/handle contracts, primitive-level dispatch/combine identity (uniform + skewed routing), custom_vjp fwd+bwd correctness, and HLO inspection (must not insert XLA collectives outside the EP FFI).
tests/jax/multi_process_launch_ep.sh: 4-rank launcher; sets XLA_FLAGS to keep XLA command-buffer capture off for the EP FFI sequence (NCCL EP graph-destroy interaction).
examples/jax/ep/ep_moe.py (+394 lines) + run_test_ep.sh: end-to-end MoE with EP, dp=ep=2 mesh, includes a ref-comparison --check that verifies fwd+bwd vs a single-process reference.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

greptile-apps · 2026-05-22T03:10:53Z

Greptile Summary

This PR lands the JAX Expert Parallelism (EP) bindings: XLA FFI handlers wrapping the nvte_ep_* C API, jax.custom_vjp-wrapped ep_dispatch/ep_combine with mesh-aware sharding rules, build wiring for the NCCL EP submodule, a multi-process test suite, and an end-to-end MoE example.

transformer_engine/jax/cpp_extensions/ep.py (+955 lines): five new primitives (EpPrepare, EpDispatch, EpCombine, EpDispatchBwd, EpCombineBwd) each with abstract_eval, lowering, impl, partition, and shardy_sharding_rule.
transformer_engine/jax/csrc/extensions/ep.cpp (+539 lines): five XLA_FFI_DEFINE_HANDLER_SYMBOL entries, NCCL communicator lifetime management via EpInstanceState.
transformer_engine/jax/ep.py (+303 lines): public ep_bootstrap, ep_dispatch and ep_combine with custom_vjp.

Confidence Score: 4/5

Safe to merge with one fix: the dispatch-backward partition function declares an output sharding with the wrong rank for grad_topk_weights, causing a JAX compile-time error in any multi-device training run that backpropagates through ep_dispatch.

The dispatch-backward partition function specifies PartitionSpec(*resolved, None) for grad_topk_weights, producing a spec one rank wider than the tensor's actual shape. Under SPMD JIT with any mesh, JAX will reject the sharding at compile time. The bug is latent in all backward-training paths and easy to hit once EP is exercised in a real training loop. The rest of the primitive stack, the C++ FFI layer, and the custom_vjp math all look correct.

transformer_engine/jax/cpp_extensions/ep.py — specifically the EpDispatchBwdPrimitive.partition method (lines 763-766).

Important Files Changed

Filename	Overview
transformer_engine/jax/cpp_extensions/ep.py	New file (+955 lines): five EP primitives with abstract_eval, lowering, partition, and shardy_sharding_rule. EpDispatchBwdPrimitive.partition uses PartitionSpec(*resolved, None) for grad_topk_weights, producing a spec one rank wider than the tensor, causing a JAX error at SPMD compile time.
transformer_engine/jax/csrc/extensions/ep.cpp	New file (+539 lines): five XLA FFI handlers plus NCCL comm lifetime management. topk_weights unconditionally wrapped as DType::kFloat32 without dtype validation (flagged in previous review). Otherwise structurally sound.
transformer_engine/jax/ep.py	New file (+303 lines): public ep_bootstrap with input validation, ep_dispatch/ep_combine with custom_vjp. VJP math looks correct; sharding constraint re-pinning in backward is well-handled.
build_tools/jax.py	Adds NCCL EP linkage with hard RuntimeError when submodule header is missing or an arch < 90 is in NVTE_CUDA_ARCHS. Inconsistency with setup.py graceful disable noted in previous review threads.
transformer_engine/jax/sharding.py	Adds ep_resource field to MeshResource dataclass and ep_axis_size() helper. Clean, non-breaking addition.
tests/jax/test_multi_process_ep.py	690-line multi-process test suite. All tests require SM>=90 hardware, so the partition-spec bug in EpDispatchBwd would not be caught in CI without Hopper GPUs.

Sequence Diagram

sequenceDiagram
    participant PY as Python (ep.py)
    participant Prim as JAX Primitives (cpp_extensions/ep.py)
    participant FFI as XLA FFI (ep.cpp)
    participant NCCL as NCCL EP (nvte_ep_*)

    Note over PY: ep_bootstrap()
    PY->>FFI: SetEpBootstrapParams(uid, ep_size, ...)
    FFI->>NCCL: ncclCommInitRank + nvte_ep_initialize

    Note over PY: ep_dispatch() forward
    PY->>Prim: ep_prepare(topk_idx)
    Prim->>FFI: EpPrepareHandler
    FFI->>NCCL: nvte_ep_prepare
    NCCL-->>PY: token_counts, EpHandle

    PY->>Prim: ep_dispatch_fwd(handle, tokens, topk_weights)
    Prim->>FFI: EpDispatchHandler
    FFI->>NCCL: nvte_ep_dispatch
    NCCL-->>PY: recv_tokens, recv_topk_weights

    Note over PY: Expert FFN runs on recv_tokens

    PY->>Prim: ep_combine_fwd(handle, weighted_expert_out)
    Prim->>FFI: EpCombineHandler
    FFI->>NCCL: nvte_ep_combine
    NCCL-->>PY: combined output

    Note over PY: ep_dispatch() backward
    PY->>Prim: ep_dispatch_bwd(handle, g_recv_tokens, g_recv_topk_weights)
    Prim->>FFI: EpDispatchBwdHandler
    FFI->>NCCL: nvte_ep_dispatch_bwd
    NCCL-->>PY: grad_tokens, grad_topk_weights

    PY->>Prim: ep_combine_bwd(handle, g_result)
    Prim->>FFI: EpCombineBwdHandler
    FFI->>NCCL: nvte_ep_combine_bwd
    NCCL-->>PY: grad_expert_out

_{Reviews (5): Last reviewed commit: "[pre-commit.ci] auto fixes from pre-comm..." | Re-trigger Greptile}

mgoldfarb-nvidia · 2026-05-22T15:15:57Z

+  }
+
+ private:
+  EpCommManager() = default;


If we use stateful FFI calls we could tie to EP communicator to the lifetime of the jax computation rather than the process.

Cool to learn! I will update it.

mgoldfarb-nvidia · 2026-05-22T15:17:01Z

+Error_Type EpPrepareFFI(cudaStream_t stream, Buffer_Type topk_idx, Result_Type token_counts,
+                        Result_Type handle_mem, Result_Type workspace, EpPrepareConfig config) {
+  auto topk_dims = topk_idx.dimensions();
+  NVTE_CHECK(topk_dims.size() >= 2,


nit: can we return FFI InvalidArgument instead of a NVTE_CHECK for these inputs?

This is probably a good idea. I suggest we make another follow-up MR to do so for all the FFIs.

phu0ngng · 2026-05-22T15:53:02Z

I would appreciate your help to review this PR @tdophung @jberchtold-nvidia!
Please focus on the changes in the JAX side, as the TE/Common ones will be discussed in #3034

jberchtold-nvidia · 2026-05-22T16:30:34Z

+    kernels = kernels.reshape(ep_size, NLE, *kernels.shape[1:])
+
+    @jax.jit
+    def step(idx, toks, w, lk):


What does lk stand for?

jberchtold-nvidia · 2026-05-22T16:44:11Z

+        leading = _ep_leading_dims(is_outer)
+        recv_tokens_aval = jax.core.ShapedArray(leading + (recv_pr, hidden_dim), tok_dtype)
+        recv_topk_weights_aval = jax.core.ShapedArray(leading + (recv_pr,), jnp.float32)
+        workspace_aval = jax.core.ShapedArray(topk_idx_aval.shape, jnp.int64)


Same comment as above about int64

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

greptile-apps · 2026-05-22T23:26:00Z

+            assert ret == 0, f"ncclGetUniqueId failed with code {ret}"
+            uid_bytes = bytes(uid_arr)


assert disabled by -O in ctypes UID path

assert ret == 0 is silently elided when Python runs under the -O optimisation flag (common in production or Numba/Conda environments). If ncclGetUniqueId fails, uid_bytes would be all zeros; the all-gather propagates those zeros to every rank in the EP group, causing ncclCommInitRank to either produce mismatched communicators or hang indefinitely with no diagnostic message.

Suggested change

assert ret == 0, f"ncclGetUniqueId failed with code {ret}"

uid_bytes = bytes(uid_arr)

ret = libnccl.ncclGetUniqueId(ctypes.cast(uid_arr, ctypes.c_void_p))

if ret != 0:

raise RuntimeError(f"ncclGetUniqueId failed with code {ret}")

…em_reloc gating Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

…s, MoE example) Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

for more information, see https://pre-commit.ci

phu0ngng requested review from jberchtold-nvidia and ptrendx as code owners May 22, 2026 03:01

phu0ngng mentioned this pull request May 22, 2026

[JAX] Dispatch/Combine for BF16 Tensor #3025

Open

greptile-apps Bot reviewed May 22, 2026

View reviewed changes

Comment thread build_tools/jax.py

Comment thread build_tools/jax.py

Comment thread transformer_engine/jax/cpp_extensions/ep.py

mgoldfarb-nvidia reviewed May 22, 2026

View reviewed changes

phu0ngng requested a review from tdophung May 22, 2026 15:51

jberchtold-nvidia reviewed May 22, 2026

View reviewed changes

Expert Parallelism: common C API + NCCL EP v0.1 backend

17e5126

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

phu0ngng force-pushed the phuong/ep-3-jax branch from eaaffbf to c502227 Compare May 22, 2026 23:08

greptile-apps Bot reviewed May 22, 2026

View reviewed changes

Comment thread transformer_engine/jax/ep.py Outdated

Comment thread transformer_engine/jax/ep.py Outdated

phu0ngng force-pushed the phuong/ep-3-jax branch from c502227 to 69b0196 Compare May 22, 2026 23:19

greptile-apps Bot reviewed May 22, 2026

View reviewed changes

phu0ngng added 4 commits May 23, 2026 19:36

Expert Parallelism: persistent ncclEpHandle cache with allow_handle_m…

cef4b33

…em_reloc gating Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

Expert Parallelism: JAX bindings (FFI, custom_vjp, multi-process test…

79bfcf1

…s, MoE example) Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

JAX EP: tie NCCL comm lifetime to JAX executables via XLA stateful FFI

dce29de

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

JAX EP: expose allow_handle_mem_reloc as opt-in ep_bootstrap parameter

e79efa9

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

phu0ngng force-pushed the phuong/ep-3-jax branch from 8cb8de4 to e79efa9 Compare May 23, 2026 23:13

[pre-commit.ci] auto fixes from pre-commit.com hooks

97275da

for more information, see https://pre-commit.ci

		assert ret == 0, f"ncclGetUniqueId failed with code {ret}"
		uid_bytes = bytes(uid_arr)

-            assert ret == 0, f"ncclGetUniqueId failed with code {ret}"
-            uid_bytes = bytes(uid_arr)
+            ret = libnccl.ncclGetUniqueId(ctypes.cast(uid_arr, ctypes.c_void_p))
+            if ret != 0:
+                raise RuntimeError(f"ncclGetUniqueId failed with code {ret}")

Conversation

phu0ngng commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Implementation

Public Python API (transformer_engine/jax/ep.py)

XLA FFI bindings (transformer_engine/jax/csrc/extensions/ep.cpp)

Primitives + Python layer (transformer_engine/jax/cpp_extensions/ep.py, +951 lines)

Sharding (transformer_engine/jax/sharding.py, +12 lines)

Build wiring (build_tools/jax.py, +41 lines)

Tests & example

Type of change

Checklist:

Uh oh!

greptile-apps Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mgoldfarb-nvidia May 22, 2026

Choose a reason for hiding this comment

Uh oh!

phu0ngng May 22, 2026

Choose a reason for hiding this comment

Uh oh!

mgoldfarb-nvidia May 22, 2026

Choose a reason for hiding this comment

Uh oh!

phu0ngng May 22, 2026

Choose a reason for hiding this comment

Uh oh!

phu0ngng commented May 22, 2026

Uh oh!

jberchtold-nvidia May 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jberchtold-nvidia May 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

phu0ngng commented May 22, 2026 •

edited

Loading

Public Python API (`transformer_engine/jax/ep.py`)

XLA FFI bindings (`transformer_engine/jax/csrc/extensions/ep.cpp`)

Primitives + Python layer (`transformer_engine/jax/cpp_extensions/ep.py`, +951 lines)

Sharding (`transformer_engine/jax/sharding.py`, +12 lines)

Build wiring (`build_tools/jax.py`, +41 lines)

greptile-apps Bot commented May 22, 2026 •

edited

Loading